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1.  Introduction 


in  (11  an  abstract  model  was  developed  for  the  specification  of  systolic  net¬ 
works  [2]  and  the  verification  of  the  correctness  of  their  operation.  The  model 
was  applied  to  the  verification  of  the  operation  of  four  systolic  networks  that  had 
been  suggested  in  the  literature.  In  this  report,  we  extend  this  model  to  allow  for 
networks  with  slightly  more  complicated  types  of  computational  cells,  namely  cells 
that  have  periodic  memory  or  multiplexing  capabilities. 

The  motivation  for  this  extension  is  that  we  have  to  free  ourselves  from  the 

simple  inner  product  cell  [3]  if  we  want  to  use  systolic  networks  in  a  wider  range 

of  applications,  it  should  be  noted,  however,  that  the  suggested  extensions  remain 
very  simp/e  in  structure  and  should  not  resuit  in  a  complica ted  design  for  the 
individual  cells.  Also  it  appears  that  the  most  desirable  approach  to  the  design 

of  widely  applicable  systolic  networks  is  to  utilize  a  fairly  general  generic  cell 
which  is  flexible  enough  to  be  used  in  more  that  one  systolic  network,  if  this 
generic  cell  were  to  be  controlled  by  microcode,  then  it  could  be  applied  easily 
to  the  implementation  of  the  suggested  extended  cells. 

The  model  presented  in  [1]  and  extended  in  this  report  is  similar  to  another 
model  developed  Independently  by  M.  C.  Chen  [4J.  Both  separate  the  network 

function  from  the  specific  details  of  a  certain  computation  and  allow  for  a  precise 
specification  and  a  formal  verification  of  systolic  networks.  However,  the  model  in 

I4J  is  oriented  toward  a  procedural  specification,  while  we  followed  a  more  alge¬ 
braic  approach.  We  should  also  mention  previous  approaches  [5.6]  for  formalizing 

systoiic  networks  by  means  of  a  delay  operator  [7]  and  a  notation  that  envisions 

the  flow  of  data  as  a  wave  front  propagating  over  the  network.  This  wave  front 


notation  has  been  shown  to  be  useful  in  mapping  given  algorithms  to  systolic 
implementations  (81.  However,  the  notation  does  not  seem  to  be  powerful  enough 
to  describe  the  operation  of  any  systolic  network,  especially  if  more  elaborate 
computational  ceils  are  to  be  used. 

The  extended  model  is  applied  to  the  description  and  verification  of  a  pipe¬ 
lined  systolic  system  designed  for  the  computation  of  finite  element  stiffness 
matrices.  This  represents  an  important  step  in  the  finite  element  analysis  exten¬ 
sively  used  by  engineers  and  scientists  for  the  solution  of  boundary  value  prob¬ 
lems. 


very  briefly,  the  finite  element  analysis  [9]  is  a  technique  for  solving  partial 
differential  equations  on  a  certain  domain  Q  with  given  conditions  on  the  boundary 
of  Q.  in  the  case  of  linear  equations,  it  involves  essentially  the  following  four 
basis  steps:  1)  The  generation  of  a  finite  element  mesh  that  divides  Q  into  m  fin¬ 
ite  elements.  2)  The  generation  of  elemental  stiffness  matrices  H9  and  elemental 
load  vectors  t9  for  each  finite  element  e.  e=l . m.  3)  The  assembly  of  the  glo¬ 
bal  stiffness  matrix  H  and  of  the  load  vector  f.  4)  The  solution  of  the  linear  sys¬ 

tem  of  equations  Hx=f. 

in  the  past  two  decades,  many  finite  element  software  systems  have  been 
developed  and  widely  used  (101.  However,  in  practice,  the  time  and  storage 
required  by  these  systems  to  complete  an  analysis  may  be  extremely  large.  This 

usually  imposes  severe  limitations  on  the  size  and  type  of  the  problem  that  can 
be  handled  and  often  leads  engineers  to  use  less  accurate  models  or  lower 
degrees  of  approximations.  For  this  reason,  many  researchers  have  considered 
some  form  of  parallel  processing  in  the  finite  element  analysis,  as  for  instance, 
the  use  of  array  processors  (11.12.13.141.  general  purpose  multiprocessors  (15.161. 
or  adaptive,  special  purpose  multiprocessor  systems  (17.18).  A  common  result  in 
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most  of  these  experiments  is  that  the  time  for  data  movement  and  interprocessor 
communication  is  very  large  and  sometimes  dominates  the  running  time. 

A  significant  achievement  in  this  area  is  the  design  of  a  finite  element 
machine  at  the  NASA-Langley  Research  Center  [19.20].  In  this  machine,  a  rec¬ 
tangular  array  of  processors  is  formed  by  connecting  each  processor  to  its  eight 
nearest  neighbors  with  a  global  bus  connecting  all  the  processors  of  the  system. 
Each  processor  is  assigned  to  the  computations  associated  with  one  or  more  node 
in  the  finite  element  mesh.  Of  course  the  nodes  in  the  mesh  should  be  mapped 
to  the  available  processors  in  a  way  that  reduces  the  communications  over  the 
global  bus  [21]. 

Along  the  line  of  systolic  architectures.  Law  [22]  suggested  a  systolic  net¬ 
work  to  assemble  the  global  stiffness  matrix,  and  Kung  and  Leiserson  [3]  and 
Brent  [23]  designed  systolic  networks  that  can  be  used  for  solving  the  resulting 
system  of  equations.  However,  no  attempts  have  been  made  to  use  systolic  net¬ 
works  for  generating  the  elemental  stiffness  matrices,  which  is  the  subject  of  this 
report. 


The  report  is  arranged  as  follows:  In  Section  2.  we  review  and  extend  the 
basic  features  of  the  systolic  model  presented  in  [1],  and  in  section  3.  we  give  a 
general  description  of  the  system  used  to  generate  the  elemental  stiffness 
matrices.  The  different  components  of  the  system  are  formally  described  in  sec¬ 
tion  4.  where  we  also  prove  that  the  system  Indeed  produces  the  stiffness  matrix 
corresponding  to  any  element.  In  section  5.  we  outline  a  general  technique  for 
the  formal  verification  of  the  pipelined  operation  of  any  systolic  network  and  then 
apply  this  technique  to  prove  that  the  suggested  finite  element  system  can  be 
pipelined  to  compute  all  the  elemental  matrices.  A  conclusion  indicates  some 
directions  for  further  studios. 


-  la  ' 


2.  Review  and  Extension  of  the  Formal  Systolic  Model. 


in  this  section,  we  briefly  review  the  main  features  of  the  abstract  systolic 
model  presented  in  [1].  Basically  a  systolic  network  Is  represented  by  a  directed 
graph  with  two  different  types  of  nodes,  namely  interior  nodes  and  I/O  nodes 
corresponding  to  computational  cells  and  I/O  cells  of  the  network,  respectively. 
The  edges  of  the  graph  model  the  communication  links  of  the  network,  in  order 
to  identify  the  elements  of  the  graph,  every  node  is  given  a  unique  label  and 
every  edge  is  identified  by  a  pair  (c.i).  where  c  is  a  color  assigned  to  the  edge 
from  a  finite  set  of  colors,  and  I  Is  the  label  of  the  node  at  which  the  edge  ter¬ 
minates.  The  only  restriction  placed  upon  the  edge  colors  Is  that  edges  directed 
to  the  same  node  should  have  different  colors  and  that  the  same  holds  for  all 
edges  directed  out  of  a  node. 

in  addition  to  the  graph  that  reflects  the  topology  of  the  network,  the  model 
associates  with  each  edge  an  infinite  data  sequence  which  is  the  sequence  of 
data  items  that  appear  on  the  corresponding  communication  link  at  consecutive 
time  units.  More  precisely,  let  N  and  R  be  the  sets  of  positive  integers  and  real 
numbers,  respectively,  and  set  =  R  u(0).  where  0  Is  a  special  element  called 
the  ‘don't  care*  element.  Then  the  data  sequence  Vf  associated  with  the  edge 
(y.i)  is  a  mapping  y^.N  such  that  y/it)€RQ  Is  the  data  Item  which  appears  on 
the  link  at  time  t.  If  yf(t)=0  for  some  t,  this  Indicates  that  we  do  not  care  (or 
do  not  know)  about  the  data  on  (y.i)  at  the  time  t.  We  use  the  convention  of 
denoting  the  pair  (y.i)  by  y,  and  the  associated  sequence  by  yf,  where  y  is  the 
greek  letter  corresponding  to  y.  At  this  point,  we  note  that  we  have  chosen  R  to 
be  the  set  of  real  numbers  because  of  the  nature  of  our  problem.  More  gen¬ 
erally.  R  could  be  any  set  of  items  that  can  be  transmitted  on  the  communication 


links  of  the  network. 
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Let  be  the  set  of  all  sequences  that  contain  at  most  a  finite  number  of 

non-0  elements.  Then  it  is  natural  to  define  the  termination  function  = 

o  o 

N  u(0)  with  the  property  that  for  any  sequence  y.  Tly)  is  the  position  of  the  last 

« 

non-0  element  in  y.  For  the  don't  care  sequence  defined  by  0  (r)=0  for  all  t. 

M 

we  then  have  T( 0  )=0.  We  also  define  the  zero  sequences  t  with  t(f)=0  for 

l<f<T(i)  and  any  arbitrary  large  T(i). 

The  computation  performed  by  a  computational  ceil  with  m  Input  links  and  n 
output  links  is  now  modeled  by  n  causal  sequence  operators  iytftg  . 

i=l . n.  one  for  each  output  link.  In  essence,  a  causal  operator  is  such  that  the 

tth  element  of  the  image  sequence  can  depend  only  on  any  element  /  of  its 
operands  with  i<t.  If  the  condition  /<r  Is  replaced  by  /<f.  the  operator  is  called 
'weakly  causal'.  For  the  exact  definition  of  causal  and  weakly  causal  operators  we 
refer  to  II]. 

In  order  to  model  the  computation  of  the  entire  network,  we  establish  for 
each  node  of  the  network  the  sequence  equations  describing  its  operation,  these 
are  the  equations  relating  the  input  sequences  and  the  output  sequences  by 
means  of  causal  operators.  Then,  if  possible,  we  solve  the  resulting  system  of 
equations  and  obtain  in  this  way  an  explicit  relation  between  the  network  output 
sequences  and  the  network  input  sequences.  This  relation  is  called  the  ’Network 
I/O  Description’.  Finally,  for  a  verification  of  the  operation  of  the  network  for  a 
specific  form  of  input  sequences,  we  substitute  these  particular  sequences  into  the 
I/O  description,  which,  possibly  after  some  manipulation,  yields  an  explicit  form  of 
the  network  output  sequences. 

As  the  above  review  already  indicates,  operators  on  sequences  play  a  key 
role  in  our  model.  One  way  of  defining  sequence  operators  is  to  extend  known 
operators  on  R  to  by  applying  the  operator  element-wise  to  the  elements  of 
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sequences.  Examples  are  the  sequence  addition  multiplication  and  scalar 
multiplication  Element  wise  operators,  in  turn,  can  be  classified  in  terms  of 

the  result  of  any  operation  involving  the  don't  care  element  0.  namely:  1)  8- 
reguiar  operators  for  which  the  result  of  any  operation  involving  6  is  0.  This  class 
of  operators  treats  0  as  a  ‘don't  know"  quantity,  and  consequently  the  result  can¬ 
not  be  known  if  any  of  the  operands  is  not  known.  2)  Non  0-regular  operators, 
where  0  is  treated  as  a  special  symbol  that  affects  the  result  of  the  operation. 
Example  are  the  operators  min^  and  maxfl  defined  in  (1).  in  practice,  this  class 
of  operators  can  be  used  to  model  a  network  where  the  communication  links  are 
augmented  by  an  additional  wire  to  indicate  whether  the  link  carries  valid  data  or 
not.  The  operation  of  each  computational  cell  is  then  dependent  on  this  additional 
piece  of  information. 

A  second  class  of  operators  consists  of  those  defined  directly  on  RQ.  In 

the  remainder  of  this  section  we  introduce  several  such  operators  that  will  be 

used  in  the  specification  and  verification  of  our  finite  element  system.  For  simpli¬ 
city.  given  any  operator  TlR^f-R^,  the  notation  •  •.£n)J(f>.  will  be 

employed  to  designate  the  tth  element  7>(f)  of  the  image  sequence 
j.  •  •  •  .€fl).  This  is  consistent  with  the  convention  of  using  square  brackets 

for  grouping.  We  will  also  use  the  symbol  +  for  integer  division  and  the  Fortran 

function  mod  0  that  specifies  the  remainder  of  an  integer  division. 

The  Shift  operator  fir  :  R&  -  RQ  Is  defined  by 

.8  if  r  >0  and  f<r 

tnr  iHt)  =  < 

£( f-r )  otherwise. 

Hence,  lor  r> 0.  nr  inserts  r  8-elements  at  the  beginning  of  a  sequence  and 
therefore  models  the  computation  of  a  delay  cell.  On  the  other  hand,  for  r<0.  ftr 


trims  the  first  r  elements  of  the  sequence  and  thus  is  a  non  causal  operator 
which  cannot  be  used  to  model  computational  cells.  The  role  of  the  negative 
shift  operator  is  to  provide  in  the  proofs  an  inverse  for  the  positive  shift.  More 
precisely,  for  any  sequence  Z.  we  have  n  r  nr  Z  =  Z •  The  converse  is  not 

always  true,  in  the  sense  that  ft r  Cl  r  Z  -  i  only  if  £tf)=S  for  f<r. 

The  Zero  Shift  operator  -»  R^  has  the  same  definition  as  n r  except  that 

fig  inserts  r  zeroes  at  the  beginning  of  a  sequence  instead  of  r  ©-elements.  The 
zero  shift  operator  is  useful  in  modeling  delay  cells  in  networks  that  initially  set 
the  data  on  their  communication  links  to  zero.  In  such  networks  we  must  assume 
that  the  entries  corresponding  to  the  time  t=l  in  any  non  input  sequence  are 

equal  to  0  rather  than  0. 

r  k  s  — .  — 

The  Accumulator  operator  A  :  R^  -  RQ  is  defined  to  model  a  cyclic  accu¬ 
mulator  that  starts  operation  at  time  t*r.  accumulates  a  new  element  every  s  time 

units  and  restart  a  new  cycle  every  sk  time  units.  The  Accumulator  operator  can 

r  k  s 

be  defined  in  terms  of  the  following  algorithm  that  computes  IA  '  Z ICf )  for  any 

f>0.  given  the  sequence  elements  €</>  for  /<f. 

IF  (f <r)  THEN  Mr'*'s€](f)  =  0  /*  accumulator  is  Idle  V 

ELSE 
BEGIN 

t  *  t  -  mod((t-r)  +  sk)  /*  time  of  last  reset  V 

na  -  ((t-tr)  *  s)  +  1  /*  number  of  elements  accumulated  */ 

ha-1 

IAr'"'5 £](r)  =  E  +*/>  /*  result  of  accumulating  na  elements  */ 

/=  0 

ENO 


Evidently,  this  algorithm  is  equivalent  with 


'  E  tlt+sl)  t>r 

1=0 

where  na  and  t  are  as  specified  before.  As  an  example,  let 


<  =  avbva2.b2.-'-.a7.b7.0.0. 


A 2’3'2  £  _  .#,p^  +bg  .#.P^  +tog  tbg  ,%.b^  .•.b^+bg  ,#,04+05  +bg  ,9.bj  .9.0.0.  •  •  • 

where  •  denotes  an  element  that  is  equal  to  the  preceding  one. 

w  1  wn  —  n  _ 

The  Multiplexer  operator  Mr  .  :  ^o1  "*  Ro  ls  <,eflned  t0  mo(,el 

a  multiplexer  that  has  n  Inputs  11  starts  its  opration  at  time  t=r  and 

periodically  multiplexes  its  inputs  with  a  time  ratio  of  wl:w2:*  •  *:wn.  If  the  length 

n 

of  the  multiplexer  cycle  is  denoted  by  *=  £  wa  ■  men  the  following  algorithm 

e=l  9 


defines  the  multiplexer  operator 


IF  <f  <r)  THEN  IM 


ELSE 


w  1 . wn 


(Ij.*  •  =  0  /*  multiplexer  idle  */ 


BEGIN 


f  =  t  -  mod  (it -r)  +  k) 
c 

Find  the  largest  Ineger  l<e<n 


/“  start  of  current  cycle  V 


such  that  (t-t  >  <  £  w, 

C  /=1  1 

..wl . wn,.  . 


/*  determine  interval  within  cycle  V 


(|v*  •  *  £a(f)  /*  chose  corresponding  Input  V 

s  n  9 


As  an  example,  let 


<  -  e,.e2.---.a7.e8.e9.0.0. 
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and 


V  =  ^1 *^2* ’ *  *  '  *  * 

then 

M3*2(C.t?)  =  0.0.a3.i>4.d5.a6.f)7.0.a9.0.-*  • 

it  is  also  interesting  to  note  that  the  multiplexer  operator  can  be  used  to 

model  a  de-multiplexer  cell.  For  example,  if  we  want  to  sample  the  sequence  £ 

1  r-1  * 

at  times  t=r,2r.3r .  then  we  may  express  this  operation  as  Mf‘  (£.0  )  where 

A 

0  is  the  don't  care  sequence  introduced  earlier. 

The  multiplexer  operator  can  be  used  to  define  twt  further  operators, 
namely,  the  expansion  and  the  piping  operators. 

The  Expansion  operator  models  a  cyclic  memory  that  is  loaded  at  time 

t=r  and  is  overwritten  every  k  time  units.  It  is  formally  defined  by 

7j  =  mJ . 1  (7?  .  nT?  .  n27?  .  .  nk  %>. 

which  on  the  basis  of  the  definition  of  the  multiplexer  operator  may  be  rewritten 
as 

0  t<r 

T)<f-ftf)  t>r 

where  =  moddt-r )  +  k).  For  example,  with  £  of  (2.1i  we  have 
Eg£  =  0.b1.«.#.«.b3.#.«.#.b5.r*.«.b7A.t.*.0.0.- •  • 

it  should  be  noted  that  the  accumulator,  multiplexer  and  expansion  operators 
are  weakly  causal  operators,  and  that  their  deflr. .tions  allow  us  to  model  cells 
with  memory  capabilities,  despite  the  fact  that  our  abstract  model  does  not  expli- 


citly  allow  the  nodes  to  have  memories  or  internal  states. 


Besides  the  causai  and  weakly  causal  operators  used  in  modeling  computa¬ 
tional  cells,  some  sequence  operators  are  introduced  here  for  the  sole  purpose  of 
allowing  us  to  simplify  the  description  of  data  sequences.  Following  are  two  such 
operators: 


The  Piping  operator  Pm  : 

,  1  m.  . , 

Pm  '  *  *  *  )  ~  M 

and  T(P^( n1.*  •.7?m))  = 
ments  of  each  of  the 
sequence. 


-fl  defined  by 


k . k,  1 


1 

mk . 


<7? 


rt(/-1)/r  / 

n  7? 


,  n  i)  ) 


In  other  words.  P_  concatenates  the  first  k  ele- 
m 


m  sequences  y  .  e=l ,  and  forms  one  long 


On  the  basis  of  the  definition  of  the  multiplexer  operator  it  is  easily  shown 
that  the  following  algorithm  is  equivalent  with  the  above  definition  of  the  piping 
operator 

IF (t>mk)  THEN  tpj^  (7?  \  •  •  •  ,ym))(t)  =  0 
ELSE 
BEGIN 

Find  the  largest  Integer  l<e<m  such  that  t<ek 
[P*m  (7?1.  •  •  •  ,ym)](t)  =  y9(t-(e-l)k) 

END 

0 

In  the  following  sections,  we  will  use  the  abbreviations  P&!Sf  m(y  )  for 

it  1  m  if  ir 

P  (7j  .•••.7j  ).  and  P*  (7?)  for  p"  (7?. •  •  •  .77).  As  will  be  seen  later,  the  piping 
m  m  m 

operator  is  very  useful  for  the  verification  of  pipelined  operation  of  systolic  net¬ 


works. 


n 


The  Spread  Operator  05  :  RQ  -R^  defined  by 

r-  £  (r~>  f  =  1.(s +1)+1.2(s +I)t1.  •  •  • 

I  I  to 

te5  m>  =  1 

0  otherwise 

£ 

Hence  0  inserts  s  6-elements  between  every  two  elements  of  £.  With  the 
sequence  £  of  (2.1)  we  have,  for  example 

Q2£  =  a1.0.0.b1.0.0.a2.0.0.62.- •  • 

Controlling  the  operation  of  systolic  cells. 

r  it  k  w  1  wn  _ it 

As  mentioned  earlier,  the  operators  A  '  '  .  Mr  .  and  F.  can  be  used 

to  model  systolic  cells,  where  the  Indices  r.k  and  s  control  different  timings  as  for 
Instance,  the  reset  times,  the  idle  times  and  the  active  times  of  the  cell.  One 
way  of  monitoring  these  different  timings  in  physical  cells  is  by  providing  each  cell 
with  a  separate  circuit  that  generates  reset  and  idle  signals.  On  the  other  hand, 
timings  may  be  monitored  also  by  signals  external  to  the  cell.  This  external  con¬ 
trol  method  treats  data  and  control  signals  in  a  uniform  manner  [24],  and  is 
especially  preferred  if  the  timing  signals  can  be  propagated  in  the  network  systoli- 
caliy. 


The  external  control  approach  is  equivalent  with  a  redefinition  of  the  opera¬ 
tors  where  the  control  indices  r.k  and  s  are  replaced  by  an  additional  control 
argument.  For  example,  the  expression  £m  used  in  modeling  a  periodic 

memory  cell  may  be  replaced  by  E(£.y).  where  the  nonperiodic  expansion  operator 
E  is  defined  by 
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3.  Problem  Definition  and  General  Description  of  the  System 

The  purpose  of  the  systolic  system  presented  in  this  report  is  to  generate 
the  finite  element  stiffness  matrices  H9 .  e=l  .  •••.m.  for  a  given  finite  element 
computation  based  on  a  given  mesh  on  the  domain  Q  of  the  problem,  in  order  to 
simplify  the  design  and  the  description  of  the  system,  we  assume  that  all  elements 
are  of  the  same  type,  and  hence  that  the  number  k  of  nodes  per  element  is  the 
same  for  all  of  them. 

The  class  of  problems  to  be  considered  Is  a  fairly  general  class  of  2- 
dimensional.  stationary,  elliptic  boundary  value  problems  [17].  The  ( i.i)th  entry  of 
the  symmetric  matrix  H9 .  corresponding  to  the  element  e.  l<e<m.  Is  given  by 
the  general  formula 

Hl.l  -  V/  <°r  *?*  <0,  f">  d,  «y  '•'*’■••••*  «•’> 

where  af  .  are  space  dependent  coefficients  specified  by  the  problem,  and 

«  9  .  »  . 

D,  yr  =  -= — ,  Dnt,  ~  — -  Dnf.  =  f. .  and  f.  (x.y)  denote  piece-wise  smooth 

1  ri  dx  2  1  oy  0ri  ri  / 

a  th 

basis  functions  with  the  property  that  ^  C x.y)  Is  equal  to  1  at  the  /  node  of  the 
finite  element  e  and  to  0  at  any  other  node  in  Q.  The  integration  in  (3.1)  Is 
performed  over  the  area  Q9  of  the  finite  element  e. 

Frequently  in  engineering  applications  the  coefficients  ar  r.l=  0.1.2  are 
approximated  by  piece-wise  constant  functions  on  each  element,  in  which  case  we 
may  rewrite  (3.1)  as 

»’./  =  V,  J0„  “>r<>  <D/<’  dx  «  <32> 

where  a9  /  are  constants  on  the  element  e.  To  evaluate  these  Integrals,  an  iso¬ 
parametric  transformation  [9]  Is  used  to  map  the  domain  of  each  element  Q9  into 
a  standard  element  Q  of  the  same  type  In  another  2-dimensional  space  (T.y). 
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A 


namely 

k 

x  =  E  *, 

,=i 

*  _  „  „ 
y  =  E  ^Of-y)  y, 

/=i 

where  tt(xj)  -  tj(x  (T.y).y  bTy)).  /=1. 
space  dT.y). 


(3. 3.  a) 

(3.3. b) 

are  the  basis  functions  in  the  new 


The  integrals  in  (3.2)  are  then  evaluated  numerically  over  G  instead  of  Q9 . 
Without  entering  into  the  mathematical  details,  we  give  only  the  final  formula  used 
to  evaluate  H9  y : 


2  e  « 


- 


Hl.l  "  f  P=0  ar.l  E  •o  M'  °W  W.V  “/'/VV 


(3.4) 


where  q  is  the  order  of  the  quadrature  rule  used  in  the  numerical  integration, 
tt^.y^).  g=1.***.q  are  the  quadrature  points  with  weights  w -  and  dettf(F.y>  is  the 
determinant  of  the  Jacobian  matrix  J  of  the  transformation  Qe  ~Q.  From  (3.3). 
this  Jacobian  is  found  to  be 


'l.l  J1.2 

S 

rc 

i =i 

D}ff(x.y)  xf 

E 

/ =i 

k 

D^U.V)  y/ 

k 

2.1  J2.2  _ 

E 

L/=i 

E 

/=1 

Because  of  the  regularity  of  the  standard  element  ff.  we  can  easily  write 

_  _  _  d*,  _  _  d*, 

the  formulas  for  f.ix.y)  and  Its  derivatives  0^  ^  and  Dpt/  Then  the 

d7  3y” 

derivatives  Df f..  r*i,2  and  /  =  1 . *  *  * ,#c  used  in  (3.4)  may  be  obtained  from  the 

transformation 


-i 


L^£_ 


Pi*,] 

=  J-T 

[V,] 

(3.5)  1 

Pi*, 

*2*/. 

-J 

where  J  T  is  the  inverse  of  the  transposed  Jacobian  matrix  . 

it  should  be  noted  that  the  quadrature  points  and  weights  as  well  as  the 

-  -  a*/  -  -  d*i 

basis  functions  and  their  derivatives  = — —  and  D2ft  =*—3  do  not  depend 

dx  3y 

on  the  specific  finite  element  that  is  to  be  processed.  Hence,  they  may  be  com¬ 
puted  at  the  quadrature  points  (x^.y^).  and  pre-loaded  into  the  system  before  it 

starts  its  operation  which  allows  for  their  repeated  use  during  the  calculations  of 

0  _  _ 

H  for  e=l. •  •  •  .m .  On  the  other  hand,  the  derivatives  and  D2fj  in  (3.3) 

have  to  be  calculated  for  each  element  using  (3.5). 

0  _  —  r 

We  denote  by  (g)  the  value  of  the  basis  function  t^g-Yg^  and  by  A^fg) 
and  v^(g).  r=l,2.  its  derivatives  Dr  tfigjg)  and  Of  t/bgjg).  respectively.  Sup¬ 
pose  further  that  (x®.y®),  /=l .♦♦*,*.  are  the  coordinates  of  the  k  nodes  in  the 
finite  element  e.  Then  the  following  algorithm  computes  the  elemental  stiffness 
matrices  He  for  •*!.•  ••.m.  (The  steps  NT  through  N5  in  the  algorithm  are  par¬ 
titioned  in  a  manner  needed  for  the  description  of  our  systolic  system). 

Algorithm  ALG1 

INPUTS 

1)  (  Aj(g).  A^(g)).  g=l.***.q  and  f=l 

2)  For  each  finite  element  e=l.  -  •  *  .m 

2.1)  (x®.y®).  f=l.*  •  *.k 

2.2)  a®/(  r.l- 0.1.2 

For  each  finite  element  9  =  1.  •  •  •  .m  DO 

Nl)  For  each  quadrature  point  g  =  l .•••.q  compute  the  Jacobian  of  the  iso¬ 
parametric  transformation  from 


N2)  For  p=l,  •••.<7  compute  the  temporary  quantities 


t]  ( g )•  •  ‘T}k  ig) 
(g  )  •  •  •7^  (g) 


fJ22(g)  -J2tl<o> 

,T.2(g)  Jhl(g) 


f  a](0)-  •  'A^(g) 


A*(g) 


N3)  For  0  =  1.*  •  •,<?  DO 

N3.1)  det(g)  =  J1  }ig)  J22W  -  J12<0>  J2.1(g) 
1 


N32)  m  d«W>  jrl(g)- 


r- 1.2.  /  =  1.- 
r=0,l,2.  /  =  !. 


det  ( g ) 

N3.3)  vj’tp)  =  wg  detig)  v^ig). 

N4)  For  /=1.  •  •  •  ,/c  compute  the  approximate  integrals 

N4.1)  For  /=!.*  •  -,/-l 
<7 


/*'  = 

1.1 


E  */<«> 

7=1 


.1. 


N4.2)  /'I  =  E 

p=i  '  1 


N5)  For  /  =  1.*  •  •.*  DO 
N5.1)  For  /=!.•  • 


./- 1 
2 


.1 


'l.i 


=  E  E  ^  / 

r=0  I *0  r,/ 

2  2 


e 


N5.2)  HJ  -  2  z  E  (c  ,  a  . 

r=0  /  =r  r*'  r*' 


7 1 .1 


\2«” 


r,/ =0.1.2 

r=0,1.2.  /=0. •  •  •  ,r 


where  cf  f  equals  to  1  If  r#/.  and  to  0.5  If  r=/. 


Figure  3.1  shows  a  block  diagram  of  the  systolic  system  that  executes  this 

algorithm,  it  consists  of  a  local  memory  LM  to  store  the  pre-loaded  values  of 

0  1  2 

vfig).  Ajig)  and  Ajig).  g- l.***.q.  and  five  systolic  subnetworks  N1***N5  that 
are  arranged  in  a  cascade  such  that  the  output  of  a  sub-network  is  an  input  for 
a  following  sub-network.  Each  sub-network  is  designed  to  perform  the  compute- 
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tion  in  the  corresponding  step  in  ALG1. 

in  order  to  compute  the  matrix  He  for  a  certain  element  e.  the  coordinates 
of  the  nodes  (x®.y®),  /=!. •••.*.  and  the  coefficients  a®  r.l=  0.1.2.  for  that  ele¬ 
ment.  are  fed  to  the  system  via  subnetworks  N1  and  N5.  respectively.  The  entries 

/=!.•••.*.  /=1  .•••./.  of  the  symmetric  matrix  H®  are  then  obtained  from 
the  sub-network  N5  after  a  delay  period  of  (q+3k+16)  time  units,  where  a  time 
unit  is  the  maximum  time  needed  by  any  computational  cell  in  the  system  to  per¬ 
form  its  operation.  This  is  basically  the  time  required  to  perform  a  Multiply/Add 
operation,  or  a  division  whichever  is  larger. 


Although  this  is  a  noticeble  speedup  of  order  k  over  the  serial  execution  of 
algorithm  ALG1.  the  real  advantage  of  the  system  lies  In  the  possibility  of  pipelin¬ 
ing  the  computations  of  the  stiffness  matrices  for  e=l.***.m.  and  of  obtaining 
one  matrix  every  3k  time  units.  Of  course,  we  also  obtain  the  advantage  of  a 
non-conflicting  and  smooth  data  flow  in  the  system  which  greatly  reduces  the 
memory  fetch  times. 


8 


Finally,  note  that  we  have  assumed  that  there  is  only  one  variable  at  each 

node,  that  is  the  degree  of  freedom  per  node  is  unity,  in  the  general  case  of  d 

degrees  of  freedom  per  node,  the  constants  a®r  r./ =0.1.2  are  d*d  matrices,  and 

consequently  each  entry  H® ^ .  f./=l. •  •  •,* .  in  the  elemental  stiffness  matrix  H®  is 

2  e 

a  dxd  submatrix.  To  compute  the  d  elements  of  .  without  slowing  down  the 

2 

system,  we  replace  the  subnetwork  N5  by  d  identical  subnetworks,  each  of  which 
generates  the  corresponding  entry  in  the  submatrix  H®  when  provided  with  the 
appropriate  entry  in  the  dxd  matrix  a? 

»  »f 


I 
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4. Formal  Description  of  ttie  System's  Components 

in  this  section,  we  describe  the  architecture  of  the  five  subnetworks 
N1.***N5.  that  execute  the  corresponding  steps  in  algorithm  ALG1.  Moreover,  we 
will  derive  the  I/O  description  of  the  individual  subnetworks  and  prove  that  the 
system  generates  an  elemental  stiffness  matrix  if  appropriate  input  data  are  pro¬ 
vided. 


it  should  be  clear  that  alternate  designs  for  the  components  of  the  system 
may  be  given.  However,  one  advantage  of  the  system  described  in  this  report  is 
its  flexibility  in  the  sense  that  only  minor  modifications  are  needed  to  use  the 
system  for  different  values  of  k  (element  type)  and  q  (quadrature  formula).  More¬ 
over.  our  primary  goal  is  to  demonstrate  the  effectiveness  of  the  formal  model  for 
a  precise  specification  and  verification  of  systolic  networks  with  computational  cells 
more  complicated  than  those  of  the  simple  Multlply/Add  type. 


4.1.  The  Subnetwork  N1 


The  graph  of  the  systolic  network  N1  Is  composed  of  2q  internal  nodes  as 
shown  in  Figure  4.1;  each  node  is  labeled  by  two  integers  (i.g)  1*1.2  and 

g=1 q.  where  q  is  the  number  of  points  used  in  the  numerical  integration  (3.3). 

The  graph  also  shows  the  color  assigned  to  each  edge,  namely  r.  p  or  z. 

Each  interior  node  (i.g)  represents  a  computational  ceil  whose  operation  is 
described  by  the  causal  relations 


^l.gt  1 
p/+l.g 


"  </.g 

n  p/.g 


..3k -2. 1.1  .  w 

n  Mgt/-1  ni.g  1 

where  s*1  for  1*2  and  s*3  for  i*l.  and 


^/tl.g  . g- 


(4.1. a) 
(4.1.b) 
(4.1.0 


Figure  4.1  -  The  graph  for  Nl. 


Figure  4.2  -  The  structure  of  a  typical 
cell  (i,g)  in  Nl. 


*  W 


(4.2.a) 

(4.2.5) 


The  graph  in  Figure  4.1  and  equations  (4.1).  (4.2)  specify  Nl  completely, 
in  order  to  analyze  the  internal  structure  of  each  cell  (i.g)  more  closely,  we  first 
note  that  equations  (4.2.a/b)  indicate  that  a  ceil  should  contain  a  multiplier  and 
two  accumulators  (see  Figure  4.2).  The  accumulators  start  operating  at  times  g+i 
and  g+i+1.  respectively,  accumulate  the  output  of  the  multiplier  every  third  time 
unit  and  are  reset  to  zero  every  3k  time  units.  The  content  of  these  accumula¬ 
tors  at  consecutive  time  units  is  expressed  by  the  sequences  x.  and  T  As 
is  clear  from  equation  (4.1.c).  each  ceil  contains  also  a  multiplexer  that  starts 
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operating  at  time  g+i-1  and  multiplexes  the  input  nj  and  the  contents  of  the 
accumulators  with  a  time  ratio  of  3k— 2: 1:1.  the  delay  element  ns  is  introduced  in 
Figure  4.2  under  the  assumption  that  the  elements  A  and  M  do  not  consume 
any  time.  In  practical  implementations  however,  these  elements  do  consume  some 
time  and  consequently  the  element  labeled  n  has  the  function  of  a  synchronizer 
rather  than  a  latch. 


After  having  described  the  architecture  of  the  network,  we  prove  the  follow¬ 
ing  proposition  that  gives  the  I/O  description  for  Nl.  which  is  an  explicit  relation 
between  the  network  output  sequences  p3  g.  <;=!.*  ••<*.  and  the  network 

input  sequences  C/  ^  p1  ,  /  =1.2.  .♦••q. 

Proposition  Nl.l  :  I/O  description  of  the  network  Nl.  For  p  =  the  follow¬ 

ing  relations  hold; 


(4. 3. a) 
(4.3.b) 


.  .g-ti.k.3  ._/- 1  A  _o-l„  , 

■ A  m  pi„  n  «/.i> 

—  Ag+/tl.>r.3  ._/ -1  ,  o -1  , 

V,  * A  m  ®1.»  n  «/.!> 


Proof:  To  prove  (4.3.b).  we  first  note  that  (4.1.a/b)  have  the  solutions 


«/.i 

J- 1 

p/.p  *  n  pi.g 

Then  from  (4.1.c)  we  obtain  for  g  =  l.-**.q  that 


3k -2  1  1  _ 

*3.g  *  n  ^gtl  '  '  (*2.g  '  X2.g  •  H.J 


(4.3.0 
(4. 3d) 

(4. 4. a) 


(4.4. b) 
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where  ^  and  are  as  given  in  (4.3.c)  and  (4.3.d).  respectively.  Using  pro¬ 
perty  P5  from  the  Appendix,  this  may  be  rewritten  as 


-  «  ,kJ3k-2. 1.1  3  3  3- 

3.g  =  0  "gtl  (/Wgt3  m  *l.g  •  n  X1.g  •  n  Xl.g>  •  X2.ff  ■  X2.g) 

Finally,  we  obtain  (4.3.b)  by  applying  property  PI 3  from  the  Appendix.  Equation 


(4.3. a)  results  directly  from  (4.4.b).l 


in  order  to  perform  the  calculations  in  step  NT  of  ALG1  for  a  certain  finite 
element  e.  l<e<m.  the  Input  sequences  must  be  described  by 
* 

V}  g  =  o  (4. 5. a) 

!  =  n/_1  £*  te2  /  =  1.2  (4.5.b) 

3  k  1119  9  99 

Pl.g  =  n  P2  (/M1  (0  *g.O  ’  n®  *0.1  '  0  9  (4.5.0 

where 


ment  e.  and  fg  Q,  <pg  1  and  <fig  2  contain  the  shape  functions  and  their  deriva¬ 
tives.  A  pictorial  representation  of  these  Input  sequences  in  the  case  k-3  and 


q=3  is  provided  in  figure  4.9  using  a  time  diagram  in  which  the  elements  of  the 
different  sequences  at  consecutive  time  units  are  displayed. 
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Proposition  N1.2  With  the  inputs  (4.5).  the  outputs  of  the  network  N1  are 
described  by 


n°  *'  'f<’1<e2Vo 

n»+3*-l  ^ 


where  T (/3e)  =  4  and  /8e(f)  =  J®  .j(g). 
and  4.  respectively. 


,  ne2>pff  ^  ,  f)2©2^  2>)  g=l.***.qi  (4. 6. a) 

5=1.*  •  •  ,q  (4.6.b) 

J®2(g).  and  J2  2ig)  ,0r  t=1'  2*  3 


Proof  :  The  proof  of  (4.6. a)  follows  directly  from  (4. 3. a).  To  prove  (4.6.b).  we  first 

3  k 

note  that  the  operator  Pg  In  (4.5.C)  indicates  that  the  first  3k  elements  of  the 
argument  are  repeated  twice  in  .  This  repetition  is  only  necessary  for  the 
operation  of  the  subnetwork  N2.  and  will  not  be  considered  here.  Hence,  we  will 
replace  the  last  3k  elements  of  the  repetition  by  don't  care  elements,  which 
reduces  (4.5.c)  to 


=  n<,~’  '°2 


<e‘  Vo  • n  6,2  Vi  •  n*  ®2  Va’ 


(4.5.e) 


Now  substitution  of  the  input  sequences  (4.5.a/b/e)  into  the  I/O  description 
(4.3.b)  results  in 

n3.g  =  n  Mgt 3  (0  '  X2.g  '  X2.g  '  n  Xl.p  *  n  Xl.g)‘  (4. 7. a) 

Here  by  (4.3.c)  and  the  definition  of  the  E  operator  and  properties  Pi  and  P7  we 

find  that 


y„  •  A9"-*-3  m9t'"2  .0  •  ne2^  ,  .  nV*s  y  -  n 

.e,  — 2,  _  .  ,e,  J2J2. 


•  n9+'~2  *2'*'3  A*]'1'1  (e2V0(O'«7]  '  ne‘[,pg.l*^]  '  n^[,pg.2*^J) 
and  by  PI 4  that 


9+'  ~2  E3e2€®l 

.e. 


.  rt0+'-l  « 1  »k  .3  «2  .  ,  ,e. 

X/.g  =  A  0  */J 

Similarly,  we  can  show  that 


(4.7.b) 


—  jg+\  .1,k,3  J2  ,  ,  ,e, 

X/  ,g  "  0  *  0  ^g.2  */] 


(4.7.0 
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For  a  further  simplification  of  the  equations  (4. 7. a),  we  consider  the  defini¬ 
tion  of  the  multiplexer  operator  with  the  restrictions  (4.5.d)  on  the  involved 
sequences.  This  gives  for  g=1.-**.g 


X2.g(fft3*_1)  =  [n°+1  q2  A1'**1  1  *  ^2 

=  tAhk,10pg  !  -  UOc) 

k  k  .. 

=  E  ,(/)  *!(/>  *  E  x®  =  J®  (g) 

/=1  01  /=!  1  ' 

where  0®  1  (g)  is  specified  in  algorithm  ALG1. 

By  a  similar  argument,  it  can  be  shown  that  £®(2),  /8®(3)  and  /30(. 4)  are 
equal  to  J®2(g).  0®  1  (g)  and  J®2<g),  resPective,y'  which  proves  the  proposition 
and  shows  that  the  network  performs  successfully  the  calculations  In  step  TM1  of 
ALG1  for  one  finite  element  e.a 


4.2.  The  Subnetwork  N2 

The  graph  of  the  subnetwork  N2  is  composed  of  q  identical  rows  g=l.***.g 
(see  Figure  4.3)  where  each  row  consists  of  three  Interior  nodes  U.g).  i  =3.4.5. 
The  edges  are  given  the  colors  p.r.s  and  s  as  shown  in  the  figure. 


Figure  4.3  -  The  graph  for  N2. 


Figure  4.4 


For  a  given  row  g.  the  computation  of  a  cell  may  be  described  as 


follows: 

For  cells  <3.g) 


3 

n.  =  n  ti  ~  _ 

4.g  3.g 

a  .3.1 

°5.g  *  n  1  PS.„ 

For  cells  (4.g) 


*4.9  '  "  »3.9 
«9  ' M  <  ^3  »3.„ 


^‘2  ‘-"3.S1 


A 

0  )] 


(4. 8. a) 


*6.0  =  n  *4.0 

5s.„  -  n  AS,,'3'1  ">4.9 

For  cells  (5.g) 


P5.0  =  n  p4.0 


M 


1.1.1 

P  +  1 


(  E' 


.3  k 
0  +4 


‘"W  •  £9t3  "4.9  •  8'» 


(4.8. b) 


From  the  above  specifications,  it  is  clear  that  cells  (3.g)  and  (4.g)  have 
identical  structure  (see  Figure  4.4)  and  differ  only  in  the  reset  times  of  their 
accumulators,  multiplexers  and  memories.  To  reset  these  elements  at  the  proper 
time,  external  reset  signal  can  be  propagated  in  the  network  as  explained  in  Sec¬ 
tion  2. 


Proposition  N2.1  :  The  network  I/O  description  of  N2  is  given  by 


Vg  *  n"  "3.0 

p7.g  *  n  "otl  01  p3.o  '  X3.g  '  Vg’ 


g- 1.  •  •  •  ,q  (4. 9. a) 
(7  =  1,  •  •  *,<7  (4.9.P) 


where 


‘3.0  *  "2  [V.g 


3  if 

Eg+3  ^.g1  "  n  ^3,g 


^2  V.,1 


‘4.8  =  n  In  Vo  '  ^t3  "Vo1  -  "2  <"  Vo  •  <Vt4  ""  "3.»> 


(4.9.C) 

(4.9.d) 


Proof  :  Equation  (4.9.a)  is  trivial.  In  order  to  prove  (4.9.b>,  we  begin  by  applying 
property  PI. 3  to  equation  (4.8. a): 

o,_  =  n  4"  M  (p  "El.-  „  ,  p,  „  *  E  _  [-7T.  J  .  0  ) 

3.0  g  3.g  g+3  3,g  3.0  0+2  3,0 

M 

Then,  we  apply  property  P14  and  use  0  to  replace  sequences  whose  values  are 
irrelevant  to  our  analysis.  This  gives 


E3*  *  tt„  j  -  tp,  _  *  E3*  „  J  .  0  ) 


1.1.1  * 

Or  _  *  fl  M  (0  ,  fl  Ipn  _  *  *-  /j  «  q  J  -  lK>n  _  1-  0  #/  n 

3.0  0  3.0  0+3  3.0  3.0  0+2  3.0 

which  from  PS  may  be  written  as 


Vo  =  W  <S'  •  ‘3.8  •  o'* 


(4. 10. a) 


where  x3  is  as  described  by  (4.9,c).  Similarly  from  (4.8.b)  we  obtain 


—  1  1  l  *  ' 

a5.0  =  Mg\  1  (0  •  0  *  X4.0> 


(4.10.0) 


where  X^  is  as  described  in  (4.9.d).  Finally,  substituting  (4.10)  in  (4.8.0.  and 
using  P13  we  obtain  (4.9,b).  which  completes  the  proof  ■ 
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The  input  links  of  N2  are  directly  connected  to  the  outputs  of  Nl.  and 
hence  the  input  sequences  g  and  p3  g  =  l.*'*,g  are  described  by  the  formu¬ 
las  (4.6). 


Proposition  N2.2  :  If  the  inputs  to  N2  are  given  by  (4.6).  then  its  outputs  may  be 
described  by 


-  n9 t3* t3  f 

p7(j  =  n»t3*t3  A*]1'1  •  06%,,  .  n2e%2> 


Where  T(fg  ^sT(tg  2 >=*  and  tg  (g).  tg  2(f)=7^(g)  with 

as  specified  in  algorithm  ALQ1. 


g  =  1.*  •  •  ,q  (4. 11. a) 
g=l.  •  •  •  .q  (4.ll.b) 
T*  (g)  and  Tf  Cgr) 


Proof  :  The  proof  of  (4. 11. a)  is  trivial.  In  order  to  prove  (4.1  l.b).  we  will  ignore 

the  value  of  the  first  3k+g+l  elements  In  the  input  p*  .  and  hence  rewrite  (4. 6. a) 

o.g 

as 


„  g  +3k  + 1 

P3.g  =  n 


a.  o 


2 

ne  v> 


2.1 


2.2 

n  9  ^g  ,2> 


(4.6  c) 


In  order  to  find  the  output  sequences  p7  .  we  obtain  an  explicit  description 
for  x3  g  and  X4  .  by  substituting  the  input  sequences  into  (4. 9. c/d),  indeed, 
from  (4.6. b/c)  it  follows  that 


p3.»  *  C*  <9%.o  •  oe%,,  .  oVV2,  - 


We  then  interchange  the  shift  and  expand  operators  using  P6  and  apply  PI 7  to 
get 


p3.„  *  £»‘3  "3.«  *  o9t3*”  m2  <a\  0  .  ne%  ,  .  n2e%  2>  • 


2_2 


=  n"  (n  n  m,"  te  n  ,  ne  ,  ,  n  e  <p„  j  .  /?  (4)i 


,2+3*-l  frt3„-l  tJ.l.l.e2 

fl  (e  rg  ,0 

g t3k  +  1  ..1,1,1..'  __2  .  ,e  ,  ,  ,  * 

=  nw  (6  .  ne  u/22(0)  •  ^  ®  * 


2  .2 


■  2  8 


where,  as  usual,  the  sequences  irrelevant  in  this  context  were  replaced  by  0 
Similarly,  we  obtain 


,  c3*  ,0+3/c  +  l  „1.1.1*  ,*  ,.2-2  .  ,e 

P^.g  £p+  2  n3.g  =  n  <0  .  0  .  ne  [J21(fl)  .  <pg  2d 


and  thus  derive  from  (4.9.c)  that 


g  +3k  +3  -.1.1.1,-.*  -~2  ,  ,e  ,  .  „  ,  * 

fi  Af^  (0  .  n©  Ug  2^^  •  iPj  ^1  .  fl  )  • 

0pt3A+2  ^l.l.l'  *  * 


111  *  *  2  2a 

Afj'  '  (0  .  0  .  fV  ©  •  'Og  2l) 


-g+3/(+3  ..-1.1.1,,*  -«2  f.e  ,  .  „  ,  * 

=  f r  t M ^  (0  .  fl©  CJg  2  (fl)  .  <fig  .  0  )  - 

.  ne2  U®  !  <C>  -  Vg  2]  •  °*)1 

_pt3/(+3  .,1.1.1,,*  -_2  f.e  ,  .  _  .e  ,  .  „  ,  * 

—  fl  A# -J  (0  ,  ft©  CJg  2  *0  *  *  Vg  ^  "  >^2  l  ®  2^  ®  1 


By  a  similar  analysis  It  follows  that 


*  _flf t3/c  +3  ,,1.1.1  „*  ,*  J2J2  , ^  .e  .  ,  „  „ 

A^^  =  n  CO  .0  .fl©  i  (pi  •  'Pg  2  “  2  ^ 

Finally,  we  substitute  into  <4.9.b>  the  computed  values  for  X3  g  and  \4  g 

together  with  the  input  sequence  p,  _  and  apply  properties  P5  and  P13  to  obtain 

o.g 

a  t3/c  t3  ^2^2 

P7.g  s  n  M1  <0  V<>  '  n®  Vl  '  0  8  r0.2  9  =  1.  •  •  •  .<7 

where 


V,"’  *  4.2'®’  •  Vl"’  -  4.1'®’  •  V®"’ 

=  4.2'®’  4((®>  *  4.i '»’  ®f'®’ =  r’'®’ 

V®"’  ’  4,1 '®‘  V®"’  '  4.2'®’  Vl*’  *  ’f'®’ 

This  proves  explicitly  that  the  output  sequences  p7  contain  the  results  of  step 


N2  in  ALQ1. 
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4.3.  The  Subnetwork  N3 


As  in  the  case  of  N2.  the  subnetwork  N3  is  composed  of  q  independent, 
identical  rows.  Each  row  performs  the  calculation  corresponding  to  step  N3  in 

ALQ1  for  a  certain  value  of  g.  l<g<q.  Due  to  the  variety  of  possible  designs 
and  to  the  simplicity  of  the  network,  we  will  not  describe  N3  in  any  detail, 

instead,  we  will  assume  that,  with  the  inputs  described  by  proposition  N 2.2.  N3 

takes  five  time  units  to  complete  its  computation  and  to  produce  for  any  g. 

l<g<4  the  outputs 


ng  +3/c+8 
na +3 k +8 


1.1  1  2  -  2  -  2  2  — 
wi  (e  Vo  *  ne  V l  •  n  0  vg.2)  (412a) 

1.1  1  2  2  2  2 
M1  (e  Vo  •  n0  Vi  •  n  6  pg.2>  (412b) 


where  vg  rit)  =  v^(g).  rit)  -  v£(g).  and  the  values  of  rffg)  and  are  as 

given  in  step  N3  of  ALG1. 


4.4.  The  subnetwork  N4 


In  this  subsection,  we  describe  a  network  that  completes  the  numerical 

r  i  **  =r  l 

integration  by  computing  the  quantities  Y  ‘  *  £  V.  v.  for  the  ranges  of  the 

'  g*l  1  1 

indices  in  the  corresponding  step  of  ALG1.  The  subnetwork  is  described  by  the 
graph  in  Figure  4.5  and  the  node  I/O  descriptions  of  a  typical  interior  node  (l.g) 
i- 9. •••,8+3k.  g  =  1. ••*.<7.  are  given  by  the  causal  relations 


(4.13  a) 
(4.13.b) 
(4.13.C) 


As  this  description  shows,  each  cell  latches  the  p  and  r  data  streams  by 
two  and  one  time  units,  respectively,  it  also  performs  a  Multiply/Add  operation 


and  puts  the  result  on  the  z  output  link. 


Proposition  N4.1  :  The  I/O  description  for  N4  is  given  by 


<<.,♦1  ■  n"  "  E  in'-9  „95  *  p9ei 


/=9.  •  •  •  .8t3/c  (4. 


*i»  »  |^irf7  i 


UA _ ( 

[-L  JH 

uT"1 

J — ■ 

- 

' — - KjJ 

— w 

T  T 

,  ,  .  z  z 

/» -  MSB  - ^  / - - ✓  /« - - 

Fi gure  4 . 5  -  The  graph  for  N4. 

Proof  :  To  prove  this  proposition,  we  first  write  the  solutions  of  (4. 13. a)  and 
(4.13.W  In  the  form 


20-9) 

it i  g  *  fl  ™9  q  Is 9,  •  •  •  ,8+3fc ,  p=l.  •  •  •  .<7 

0-9) 

P/  g  *0  Pg  p  /=9,  •  •  •  ,8+3k,  0  =  1.*  •  *.<7 

and  then  substitute  them  into  (4.13.C).  This  gives 


^(/  -9) 

p/.g  =  n  p9.g 


,  _  _  ^  20-9)  _  .  J -9 

*/.g  +  l  "  n  ^/.g  +  n  ^9,g  n  p9.p] 


(4.15) 


By  Lemma  1  in  the  Appendix,  the  solution  of  (4.15)  for  a  fixed  i.  9</<8+3k  is 
then  found  to  be  identical  to  equation  (4.14).  which  completes  the  proof  s 
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in  order  to  perform  the  computation  in  step  N5  of  ALGl.  the  input  links 
z.  ^ .  is 9.  •••.8+3k  should  be  permanently  set  to  zero.  That  is  to  say.  in  the  I/O 
description  (4.14)  we  must  set  {..  1  =  t.  where  t  denotes  the  zero  sequence  of 
section  2.  With  this,  we  rewrite  (4.14)  as 


/  -  r  o/-8+<7"0  tn1'9  it 

0  =  1 


P9.pJ 


/  =9.  •  •  •  ,8+3k  (4.16) 


The  next  step  in  the  verification  of  N4  is  the  calculation  of  the  output 
sequences  for  a  specific  form  of  the  Input  sequences  ^  and  pg  .  As  Figure 
3.1  shows,  the  outputs  of  N3  are  inputs  to  N4.  and  hence  ^  and  pg  are 

described  by  the  formulas  (4.12).  Unfortunately,  it  is  not  at  ail  simple  to  find  an 
explicit  description  of  the  output  sequences  for  this  specific  input,  in  order  to 
simplify  the  equations,  we  will  replace  the  index  i.  9</<8+3fc  by  /'=9t3u+v.  where 
the  indices  u  and  v  vary  In  the  ranges  0<u<A-l  and  0<v<2.  More  descriptively, 
we  divide  the  3k  columns  of  N4  into  k  groups  of  3  columns  each.  Thus,  we 
rewrite  the  network  description  (4.16)  as 
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,  v'  3u+v+w  ,rt3uf-v.,  1,1.1  ia2— 

Cu.v.q +  1  -  E^n  in  M,  (6  v 


frt3utv.,l.l.l  ,_2—  „„2—  2  2— 

in  m,  (6  Vfl  0.  ne  Vgy  n2e2vff  2> 


■  "!','1<eS.O  '  ne%.l  •  n2®2,/„,2>  1 


.  3u  tv  tw  £ 


,‘Vv.o  '  "!  ''1(eS.O'  neS,r  n2®%.2>  1 


(4.19) 


where  w=qt3kt9  and  Xu  y  is  found  by  properties  P4  and  P5  to  be  equal  to 


wi  1  1  (e2n°  vg  Q  .  ne2nu  vg  l  .  nVn"  Vg  2> 


if  v=0 


Vv.fl  s  Mi'1'1(e2n0+1  2  .  ne2nu  vg  Q  .  n2e2nu  7  ,)  if  v=i 


Mi  1  1  (e2nw+1  vff  #1  .  ne2nu+1  vg2  .  n2e2nu  vg  Q) 


if  v=2 


The  result  (4.18)  is  then  obtained  by  first  applying  Pi  to  perform  the  multi- 


+v 

plication  in  (4  19),  then  by  pulling  n  out  of  the  M  operator  with  the  help  of 
P5  and  by  applying  the  summation  to  the  arguments  of  M  (property  PI. 2).  As  an 
illustration  of  the  derivation  procedure,  we  consider  the  case  v=l  for  which  we 


.  _3utl+w  2  .J.l.1,^2.  t/+l  — 

^u.l.q  +  l  n  E  W1  i  (0  [n  v  *  vj,o1  * 

g =1 

n®2[n"  Yo  '  -2.il  •  n2®2""  -,.l  '  -2.2» 


*  n3utUw  »  .  ne2n“  .  n2e2n"  ,;  2> 

20  01  12 

where,  from  Pi,  T(t)u  )  =  k-u- 1,  T(tj” )  =  7 (7)^  )  =  k-u  and 


on  ^  ^  _o  n  on 

V  =  *  lV9.2a>  “  Vp.0(f-U-1>J  =  E  ^?<9>  V^.^a))  =  V'f-u-, 

9-1  9=1 

C®1  '  E,  tv  0(f>  *  »  ,«-»))  -  E  (5?<b)  v1  <2)1  = 

9=1  9  =  1 

li'2®!  -  E  (v  1»>  ■  2«~ vll  =  E  tv 1  <0 >  V?  <2)1  =  ^ V2-u 

9=1  9=1 


Finally,  we  apply  PS  to  get 

,  ^2(3ut1)i 


,_2(3t/t1)+w  ..1,1.1  ,2  0.1  2  1.2  2  2  2  0 

*  n  "i  16  iu  •  ne  V  •  nV 

which  is  a  special  case  of  (4.18)  for  v=l.  The  cases  v=0  and  v=2  are  proved  in 
an  exactly  similar  way.B 
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Equation 

(4.18) 

shows  that  the 

output  sequences  y  .j 

contain 

the 

results  of  the 

numerical  integration 

needed  for  the 

calculation  of 

the 

stiffness 

matrices  in  the 

next  subnetwork  N5. 

It 

also  specifies 

precisely  the 

time 

of 

each 

output  data  item.  In 

figure  4.6. 

this 

specification 

is  translated 

into 

a 

time 

diagram,  where 

we  plot 

the  elements 

of 

Cu  y  q+1  versus  time  for  the 

special 

case 

of  k=3  and  q=3. 

4.5.  The  Subnetwork  N5 


The  network  N5  is  composed  of  three  different  rows  (see  Figure  4.7).  Row 
q+1  contains  3k  identical  nodes,  it  receives  the  constants  a®  f  on  the  links 
Pgq+y  rg  gfi  and  599  +  1  and  dlstrlbutes  them  appropriately  on  the  b  colored 
links  such  that  each  integral  Y^'j  appearing  on  a  z-coiored  link  meets  the 

corresponding  constant  a®  y  at  the  right  time.  Row  q+2  also  contains  3k  identi¬ 
cal  nodes  and  computes  the  partial  sums  Ur.  =  £  a®  /'(  and 

•'  /=  o  r'  '' 

2  e  si 

L  (cf  i  8r  j)  rj  i  for  l*i  and  1=1.  respectively,  where  cf  f  is  as  given  in  ALG1. 

e  2  r 

Finally,  row  q+3  contains  only  k  nodes  that  complete  the  sum  hr  ~  £  u 

r=0  1,1 

The  edges  of  the  graph  are  given  the  colors  p.  r.  s,  b.  z.  z°.  z1  or  z2  as 

shown  in  figure  4.7.  Note  that  we  used  three  different  colors  z®,  z1  and  z2  to 

satisfy  the  restriction  that  no  two  edges  ending  at  a  node  have  the  same  color. 
To  simplify  the  analysis,  we  consider  each  of  the  three  rows  separately. 


We  consider  first  the  row  q+1  in  which  each  cell  simply  latches  the  four 
data  streams  z.  p.  r  and  s  by  one  time  unit,  and  selects  the  output  on  the  b 
link  to  be 


n  lhi  •  */.g  + lj 


^/.g  +  1  =  j  n  pi.q  + 1 


'l.q  + 1 


If  /=9+3u.  u=0, 


if  /  =9+3u  +1 ,  u  =0.  •  •  -  .*-1 


If  /'=9+3u+2.  u=0. 


where  hj  =0.5  for  /=9  and  h.  =1.0  for  />9.  The  factor  0.5  is  needed  to  imple¬ 
ment  step  N5.2  In  ALG1.  where  only  the  rf'f .  Hr  are  explicitly  available  for  the 
computation  of  H®  / .  while  we  have  for  />r 

For  the  proper  operation  of  the  system  the  input  sequences  should  be 
described  by 

*9.g-t-l  =  n  Pk(a0} 


P9.qtl  =  n  Pl(a^ 
a9.gtl  =  n  Pk(a2) 


(4.20) 


where  for  /=0.1.2.  T (ay  =  3  and  aXf)  *  **_iD2/  r_-j  •  with  D  denoting  the  modulo 

3  addition  operation.  More  descriptively,  we  input  on  each  line  three  of  the  con- 

B  3 

stants  ar  r./=o.l,2.  repeated  k  times  as  indicated  by  the  piping  operator  (for 

more  details  see  Figure  4.9). 


Using  the  two  indices  0Cu<k-l  and  0<v<2  as  in  the  previous  subsection, 
and  noting  that  the  Input  j  is  given  by  (4.18),  we  can  easily  show  that 

cu.v.gt2  55  n2(3u+'°+"tl  mJ'1,1  o2  V°uv  .  ne2  t?^vD1  .  n2e2  t>2'vD2>  (4.2i.a) 

.  ^3utv+w  +  l  n3 .  .  ,  ..  . . 

$  .n  =  (1  P.(  h  .  a  )  (4.21.b) 

u.v.q+2  k  u.v  v 

where  h  =  0.5  if  u=v=0  and  h  =1.0  otherwise. 


TT 


p 


-  36  - 


The  3k  cells,  (i.q+2),  9</<8t3k.  in  row  q+2  have  basically  the  same  struc¬ 
ture.  each  is  a  multiplier/adder  equipped  with  a  demultiplexer  that  distributes  the 

results  to  the  output  links  p/+1  2  and  £*  ^  (see  figure  4.8  where  u  and  v 

/  -9 

equal  the  quotient  and  the  remainder  of  respectively).  Formally,  the  operation 
of  each  cell  (i.q+2)  is  described  by 


2  12 

P/*M*2  =  n  W 


/  -9-nr-H 


(i  .  p,  +  xp 


r 


Cu.qr+3  *  n  W/-9+wtl(p/  +  X/  '  ° 


(4.22.a) 
(4.22.b) 

where  A.  =  6/  q+2  ’  q +2  and  ,he  lnput  p9qt2  Js  Permanent,y  set  to  the  zer0 
sequence  i.  For  a  description  of  the  outputs  Cu  q+y  solve  (4.22)  using 
Lemma  2  in  the  Appendix.  This  yields 


n  «/4+w+i  ‘S 


l) 


'u.q+3 


1.2 
/  -9 
.1.2 


n  M/-9+w+l(In  X/-l  +  X/J 


i) 


n  w;r9+w+1<  ♦  n2x/_1  *  x.j 


t) 


i-9 

/  =  10  (4.23) 

/*!!.••  •  .8+3k 


where  by  (4.22)  and  (4.21).  A/  =  X3</tv+9  ,s  0lven  by 


x  .  rt3u  tv+w  tl  fp3 

\  ■  n  "*Vv 


•  n3utv  ml-’V,!’’  .  neV,D’  .  n  ezi,2y°2n 


«  ,p3 

k—uu.v 


-  n"  M,,’,’<e2,°v  .  ne27i^'vD1  . 


-v . 1 

Moreover,  with  the  help  of  property  PI 7.2  we  rewrite  this  as 


x/  “  x3utv+9 
where 


n6U«twtl  nv  n92ttl.vDl 


n2e2#2',02> 


(4.24) 


"1 


H 
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r.vDr  _  ^  ((vdr)i-l) 

u  u  ,v  y 

that  is  T(nr‘*)  =  Tirf'1)  and 
u  u 


r.vDr  _ 
Ju 


hu  ,v  ar.vDr  ^ u 


r  .vDr 


r=0.1.2 


rj,.,  a  r .! 

u  (t)  =  h  a  .  7i 
u  u  ,v  r.l  'u 


(4.25) 


Proposition  N5.1  :  With  the  input  described  by  (4.18)  and  (4.20).  the  intermediate 
sequences  ^3,  u=0. •  •  •  ,/t-l,  v=0,l,2,  are  given  by 


’u.qt  3 


6u  +v+w  +  l  .J.2rn3A2  .  2,2  2.1  2.0,  * 

n  Af1  (n  e  c^o  t  .  0  ) 

_6u  +v  +w  +1  .  .1.2  ...3-2  ,  1,2  1.1  1.0  * 

n  m1  (n°e‘  ^  +  ^^1  .  0  ) 


L  6utvtirtl  1-1.2  3  2  .  0.2  0,1  0.0,  * 

n  m1  (n  e  u„  +  n„  +  m„  )  .  0  ) 


for  v=0 

for  v  =  l  (4.26) 
for  v=2 


u  '“u  '  '"u 

where  we  extended  the  definition  of  such  that  nr_‘J  equals  the  zero  sequence. 


Proof  :  For  the  case  />11,  we  first  use  P5  to  rewrite  (4.24)  In  the  detailed  form 


x.  = 


X3u+v*9 


6u+w  .*1.1.1  ,-3a2  2.2  „2  0,0  2  2  1  1 

n  m,  (n  0  .  ne  n u  .  n  9  n’  ) 


1 


n6u+w+i  wi.i.i  (n3©2^'2  .  n4e2tfy'°  ,  n2©2^'1) 


C  fiu+w+2  M1.1.1^3a2  0.2  ^2. .1.0  X2..2.1 


n 


M 


irre^n 


n  0‘it 


,  n-e  mu-  ) 


for  v=0 
for  v  =  l 
for  v=2 


Then,  for  the  evaluation  of  X/_1  =  X3u+y'+9'  we  note  that  and 

hence  1- 1  =  3u+vt8  should  be  written  in  the  form  3(u-l)-t-2-t-9,  3u+0+9  and 
3u  +  l+9  for  v=0,l  and  2.  respectively.  With  these  forms  for  l-l  in  (4.24)  and  the 
help  of  P5  we  get 


_6  u+w 

r 

2  0,2 

• ns V-i  • 

2  2  10 
ne\w 

for 

O 

II 

> 

_6utw+l 

-  n 

.  nv«2-2 

.  n2e2»°’°> 

for 

V=1 

_n6utwt2 

<nV^’ 

.  n5e2*2°> 

for 

v=2 

Similarly,  we  write  i-2  as  3(u-l)+l+9,  3(u-l)t2+9  and  3ut0+9  for  v=0,l  and 


2.  respectively  and  get 
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I 


~6u+w  ..1. 1.1  ..3-2  2.0  ^2  0.1  2  1.2  „ 

n  (ne  nu .  <Wmu_7 


for  v=0 


n4x(_2  .{  n6"’"'4'  Mj-’-'tnV*;-®  .  nV^.',  .  nV^'f,)  .or  «=i 

,n6ot»«  MJ.l.lm3e2aO.O  nVMl.l  n592a2.2)  )or  „=2 

4  2 

Then  by  adding  these  three  formulas  to  get  fi  X/-2  +  fl  X^_j  +  by  and 
substituting  the  result  in  (4.23).  we  directly  obtain  the  equation  (4.26)  for  l<u  <*-l. 

The  case  u=0.  that  is  /  =9. 1 0  and  11.  can  be  analyzed  in  an  exactly  similar 
manner  yielding  the  result 


r 


’O.q+3 


n6u+w+v  +  l  MJ.2(n302  [/x2.2j  fl'} 


_6u+w+v  +  l  .  .1.2  ,03_2f  1.2  ^  1,1,  * 

n  m1  (n  e  iti  +  i  .  o  > 

(_~6u+w+v  +  l  .,1.2  .-3^2  ,  0.2  0.1  0.0,  * 

n  m}  in  e  in(J  +  nu  +  j  .  o  ) 

which  by  defining  =  t  may  also  be  put  In  the  form  (4.26). I 


for  v=0 
for  v  =  l 
for  v-2 


0  1  2 

Finally,  each  group  of  three  sequences  iu  g+3.  Cu  q+3  and  Cu  fl+3  is 
considered  as  input  to  a  cell  (u.qt 3).  0<u<X-l,  in  row  q+3  of  N5.  The  opera¬ 
tion  of  a  typical  cell  in  row  q+3  is  formally  expressed  by 


3f 


•  .  6u+w+2.3.1  „1.1.1  ,.0  1  .2 

tu.q+  4  =  0  lcu  '  A  M6u+w+2(^u,q+3  '  Cu.q+  3  '  ^u.q+3)] 

“  n  Icu  •  M6u+w+2 


*  *  2  0 

,(o  .o  .  trrc 


1  2 

u  ,q  +3  f  n,*u.q+3  +  ^u.q+3])  1 


(4.27) 


where  equals  to  2  for  u=0  and  to  1  otherwise. 

By  substituting  the  sequences  (4.26)  Into  the  network  description  (4.27)  we 
easily  find  the  description  of  the  output  sequences  as 


,  6u+w+7  2  - 

V«t4  *  "  9 


u=0.  •  •  •  ,X-1 


(4.28) 


where 
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K 


1 


-  39  - 


H 


u 


2 


Z 

I  =r 


+ 


2 


Z 

r  =  1 


r- 1 

z 

1=0 


r=0 


if  u  >0 


if  u  =0 


Using  the  definition  of  from  (4.25.)  in  (4.28)  and  comparing  the  result 

with  step  N5  in  algorithm  ALG1.  we  readily  prove  the  following  proposition: 


Proposition  N5.2  :  If  the  Inputs  to  the  network  N5  are  given  by  (4.18)  and  (4  20). 
then  the  network's  output  sequences  are  given  by 


,  _6 u+w+7  .2  — 

C„  ...  =  fl  6  (i 

u.q+  4  'u 


where  T (^)  =  k-u  and  nu(t)  =  H( 


u  =0.  •  •  •  ,k-l 


(4.28) 


Proposition  N5.2  states  that  after  an  initial  time  period  of  6u+3k+q+16  units, 
each  output  link  Cu  ^+4  will  carry  the  elements  of  the  uth  off-diagonal  of  the 
stiffness  matrix  Hr .  separated  from  each  other  by  2  time  units. 


To  summarize  the  behavior  of  the  entire  system,  we  show  in  Figure  4.9  a 

time  diagram  of  the  data  on  all  the  input  and  output  links  of  the  global  system, 
it  represents  a  translation  of  the  sequence  equations  (4.  s)  (4.20)  and  (4.28)  for 
the  special  case  k=3  and  q=3.  The  data  Items  in  the  input  sequences  4 ^  42< 

ff9g  +  l'  Pgqti  ancl  °9<7+i  dePend  on  the  finite  element  that  is  being  processed 
and  hence  they  must  be  provided  from  outside  the  system.  On  the  other  hand, 

the  data  in  p.)  .  g=l.***,q  do  not  depend  on  a  particular  finite  element  and 

thus,  as  mentioned  in  Section  3.  they  are  provided  from  a  memory  local  to  the 
system. 


in  general,  the  time  for  completing  the  computation  of  one  element  stiffness 
matrix  is  9k+q+10  time  units.  In  the  next  section,  we  will  prove  that  the  computa- 
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5. Verification  of  Pipelined  Operation 


For  a  given  systolic  network  that  has  been  shown  to  perform  successfully  a 
certain  computation,  we  want  to  study  the  issue  of  repeating  the  same  computation 
on  different  data  in  a  pipelined  fashion.  Assume  that  a  certain  systolic  network  N 
has  the  I/O  description 

7?,  =  r.<£r  •  •  -  .ln)  /  =  !,•  •  *.p  (5.1) 

where  i ^  /=l.***.n  and  t)f.  /=l,***.p  are  the  Input  and  output  sequences  of 
the  network,  respectively,  and  iy  /=!,••  *,p  denote  certain  causal  operators  that 
model  the  behavior  of  the  network.  Suppose  also  that  for  a  certain  input  descrip¬ 
tion 

ij  =  nr/  a,  /  =  (5.2) 

with  given  integers  r/  and  sequences  ay  we  were  able  to  show  that  the  outputs 
are  described  by 

Vj  =  nSI  0j  /  =  !.•• -,p  (5.3) 

with  certain  integers  si  and  sequences  0r  That  is.  in  other  words,  suppose  that 
when  (5.2)  is  used  in  the  equations  (5.1),  then  we  were  able  to  prove  that 

("Is  @1  =  rxnr  1  at.j.  •••  ,nrn  atfl)  /=l,***,p  (5.4) 

The  calculation  of  the  elements  of  0..  /=l.***.p.  from  those  of 
/' =1 . •••,/»  using  the  network  N  shall  be  called  the  computation  *C".  The  time  of 
this  computation  is  defined  as  the  time  required  by  N  to  complete  C  from  the 
moment  when  the  first  non-0  Input  entered  N  to  the  moment  when  the  last  non-0 
output  was  produced.  More  precisely. 

Time  (C )  =  max(  T  inSI  0.):  1  </<p/  -  mlnC  r/ :  1  <i<n)  (5.5) 

where  T  is  the  termination  function  defined  in  Section  2. 
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Often,  it  is  desirable  to  repeat  the  computation  C.  say  m  times,  with  dif¬ 
ferent  data  sets  A®  =  (cr®;/ =1.  •  •  *n).  e=l.***,m.  Let  us  denote  these  m  instances 
of  C  by  C®.  e=l  .  •••.m.  In  many  networks,  this  may  be  accomplished  by  pipe- 

1  m 

lining  C  .  C  The  time  difference  between  the  initiations  of  two  succes¬ 

sive  instances  C®  and  Cet1  will  be  defined  as  the  pipe  separation  r  of  the 
computation  C.  In  this  case,  the  inputs  for  the  different  instances  of  C  should 
be  pipelined  on  the  input  links.  That  is  equation  (5.2)  for  the  input  sequences 
should  be  replaced  by 


t*  =  nr'  PT  (a®) 
/  e  =  1.m  / 


/  =  ! .  •  •  •  ,n 


where  we  used  the  asterlx  In  to  Indicate  that  the  sequences  represent  the 
input  data  during  the  pipeline  operation.  We  will  also  use  l®  to  represent  the 
inputs  (5.2)  for  a  specific  instance  C®  of  the  computation.  This  *  and  e  super¬ 
script  notations  will  be  used  in  the  remainder  of  this  section  for  sequences  on 
any  communication  link. 

If  the  computation  can  be  successfully  pipelined  on  N  with  a  separation  r. 
then  by  using  the  inputs  (5.6)  in  the  network  I/O  description  (5.1),  we  should  be 
able  to  prove  that  the  output  sequences  during  pipelined  operation  are  described 


* n*'  p8T=i.*X> 


in  order  to  ensure  a  successful  pipelined  operation,  the  pipe  separation  r 
must  be  large  enough  so  that  the  inputs  of  the  different  Instances  C®  do  not 
overlap  and  the  corresponding  outputs  do  not  overwrite  each  other.  The  first 
condition  implies  that  r>T(a®).  /=1.  •••./!.  and  the  second  that  t>T(£®). 

».p.  In  other  words,  the  minimum  pipe  separation  rm(C)  for  the  computa¬ 
tion  C  is  equal  to  the  maximum  span  of  all  the  input  and  output  sequences  in  C. 
where  the  span  of  a  sequence  is  defined  as  the  time  difference  between  the  first 
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and  the  last  non  O-elements  in  the  sequence  plus  l.  that  is  the  time  during 
which  the  sequence  carries  imformation  relevant  to  the  computation.  Hence,  from 
the  viewpoint  of  pipeline  operation,  a  network  that  can  be  used  to  pipeline  a 
computation  C  with  a  pipe  separation  r^fC >  achieves  maximum  efficiency. 


in  order  to  prove  (5.7)  from  (5.6)  and  (5.D  without  repeating  the  effort 
spent  in  deriving  (5.4).  we  use  the  negative  shift  operator  and  the  equation  (5.2) 
to  rewrite  the  pipelined  input  (5.6)  as 


«/  *  n"  Pl.y,m  <n‘"  «?>  '*’•••••»  IS.® 

where  €*  are  the  inputs  that  would  be  used  if  the  Instance  Ce  of  C  had  been 

performed  on  N  without  any  pipelining.  Next,  we  substitute  (5.8)  into  the  network 
I/O  description  (5.1)  and  obtain  for  /  =  !,••  •  .p 


v,  =  rtm 


r  1 


e  =1  .m 


<cfrl  €*)].« 


.in 


rn 


pI-i  m(n’rn«!>» 

e  =  i.m  n 


(5.9) 


The  remainder  of  the  proof  is  based  on  the  use  of  the  different  properties 
in  the  Appendix  for  factoring  the  shift  and  the  piping  operators  out  of  the  causal 
operator  IV.  if  the  computation  can  be  successfully  pipelined  through  N.  then  we 
should  be  able  to  transform  (5.9)  Into  the  form 


-  ns/  PT 


e=l.m(  > 


I- 1.  •  •  •  ,p 


(5.10) 


which  by  (5.1)  and  (5.3)  directly  reduces  to  (5.7). 


it  should  be  noted  however  that  there  exist  computations  for  which  there  Is 

no  value  for  r  for  which  (5.10)  is  derivable  from  (5.9)  which  means  that  the  com¬ 

putation  can  not  be  pipelined.  On  the  other  hand,  we  can  identify  a  class  of 
computations  for  which  pipelining  is  always  possible.  We  use  the  term  ’Inert*  to 
identify  computations  in  this  class,  in  other  words,  a  computation  C  on  a  systolic 
network  N  is  called  inert  if  it  has  the  following  two  properties 

l)  At  its  initiation.  C  does  not  care  about  the  data  on  the  non  input 


communication  links  of  N,  that  is  we  may  assume  that  at  time  t=l.  the  data 
in  any  non  input  sequence  are  0's.  This  implies  that  any  delay  in  N  should 
be  modeled  using  the  shift  operator  and  not  the  zero  shift  operator. 

2)  Only  0-regular  operators  are  used  for  modeling  the  cells  in  N.  This 
implies  that  the  network  does  not  treat  0  as  a  special  symbol. 

it  is  always  possible  to  pipeline  an  inert  computation  C  through  the 
corresponding  network  N.  In  fact  we  may  simply  chose  the  pipe  separation  r  to 

Afl 

be  the  time  of  the  computation  as  defined  by  (5.5).  With  this  value  of  r.  C 
does  not  start  before  Ce  is  terminated.  Of  course,  we  are  not  Interested  in  such 
large  values  of  r.  and  hence,  the  problem  arises  of  finding  the  least  value  of  r 
for  which  (5.10)  is  derivable  from  (5.9). 

As  should  be  clear  from  the  above  discussion,  the  ability  to  derive  (5.10) 

from  (5.9)  is  the  major  Issue  In  verifying  the  pipeline  operation  of  any  systolic 

network,  and  this  ability  depends  principally  on  the  value  of  r.  However,  for  any 

inert  computation  C.  we  know  that  there  exist  a  value  for  which  (5.10)  is  derivable 

from  (5.9).  In  order  to  find  the  least  possible  r.  we  start  with  r  =  r  (C)  and 

m 

proceed  to  factor  out  the  shift  and  piping  operators  from  (5.9)  until  we  either 
reach  (5.10).  which  is  our  goal,  or  we  cannot  continue  the  factorization  due  to 
our  small  value  of  r.  In  the  latter  case,  we  increase  r  appropriately  and  repeat 

the  derivation  procedure. 

For  ail  the  networks  presented  in  Section  4,  where  all  the  computations  are 
inert  and  the  maximum  span  of  all  input  and  output  sequences  is  3k.  it  can  be 

proved  by  the  above  technique  that  the  m  instances  of  the  computation  of  the 
stiffness  matrices  can  be  pipelined  through  the  system  with  a  separation 

r  =  rm  =  3/c.  hence,  the  entire  system  can  be  used  to  generate  the  m  stiffness 
matrices  in  a  time  equal  to  f+3(/n-l)k,  where  f  =9k+q+10  is  the  time  consumed 
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by  the  first  instance  of  the  computation,  in  order  to  illustrate  the  derivation  pro¬ 
cedure.  we  will  apply  it  to  the  verification  of  the  pipeline  operation  of  the  subnet¬ 
work  N4. 


5.1.  Pipeline  verification  of  N4 


Before  starting  our  verification  procedure,  we  recall  that  in  section  4  the 
network  t/O  description  of  N4  was  found  to  be  given  by  equation  (4.16).  which  is 


Wi  *  C  n'-atq-°  in'”  *9e  -  p9o, 


/  =9.  •  •  •  ,3k  +8  (5.11) 


Moreover,  when  the  inputs  for  a  certain  instance  C®  of  the  computation  are 


ir*  =  n0+3*+a  ? 

9  g  g 

e  g  +3k  +8  e 

p9.g  =  n  vg 

then  the  outputs  are  given  by 


,e  _2/+3k+g-9  e 

■ n  »< 


p  =  1 ,  •  •  • 

(7  =  1.  •  •  •  .q 


(5.12.a) 


(5.12.b) 


I  =9.  •  •  •  ,3k +8  (5.12.C) 


where  the  detailed  forms  of  the  sequences  and  containing  the  input  data 
for  C®.  and  the  sequences  17*  containing  the  results  of  C®  are  specified  by 
(4.12)  and  (4.18),  respectively.  For  the  following  discussion,  we  do  not  need 

these  detailed  forms,  it  suffices  to  know  that  T(v®)  =  T(v®)  =3k  and  TO??)  =  3k- 

g  g  1 

(i-9).  and  hence  that  the  minimum  pipe  separation  is  rffl  =  3k. 

if  the  computation  C  is  pipelined  through  N4  with  a  separation  of  3k.  the 
inputs  should  have  the  form 


n° t3*  *8  P3*  m<vJ) 
o  =  1  .m  g 


e  =  i.m  g 


-  n» ,3*  ,8  p3‘_ .  m  in"19 t3* t8>  »•  )  (5.  i3.« 

e  =  i.m  9,0 


n9t3*t8  p3*L,  _<n'<9,3*t81  p»  > 

e  =  i.m  y.g 


(5.13.0) 


46 


Using  this  in  the  network  I/O  description  (5.11).  we  get  the  pipeline  outputs 


in  the  form 


'I  ,o+l 


£  n'-8t"-0  in'*9  nst3*,s  p3\  „•  ,  . 

^  =  1  e=l.m  9.0 

n9t3*ta  p»*i,m<n'<9t3*+8>  p*  »  (S.i.» 


Now.  by  properties  PI  and  P8  in  the  Appendix  we  obtain 

,*  -  rt/+3*+<*  #j3/c  .  -U+3k+q)  £  /  -8to  -a  /  -9  e  e 

C/.«*l  *  n  p..iV  n  E  n  m  "9.8  '  P9.e’  ’  <5,5> 


which  by  (5.11)  and  (5.12.C)  reduces  to 


r  _  (J  p3^  //si  “9  _e% 

S.g+1  ‘  n  Pe=l.m(n  V 


(5.16) 


Finally,  because  of  Tw!~9ii°)  =  i-9+T(7?®>  =  k.  we  use  P8  to  write  (5.16) 


«  .  2/  +3k  +<7  -9  3A  e> 

S.d  +  1  "  n  Pe  =  l.m(V 


which  proves  that  the  sets  of  results  IV/ ;  /=9.  •  •  •  ,3*t8)  of  the  different  instances 
e=l,***,/n  will  be  correctly  produced  at  the  rate  of  ^  If  the  set  of  inputs 

(v°g  .  v®;  0=1.*  •*.g}  are  pumped  through  N4  at  the  same  rate. 


* 

i 

I 
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6.  Concluding  Remarks 

This  paper  demonstrates  the  power  of  the  extended  systolic  model  by 
applying  it  to  the  specification  and  formal  verification  of  a  systolic  system  that 
can  pipeline  the  computation  of  the  elemental  stiffness  matrices. 

There  were  no  difficulties  In  establishing  analytical  proofs  for  the  opera¬ 
tion  of  the  different  components  of  our  system.  The  reason  for  that  may  be 
the  absence  of  feed  back  loops  and  the  fact  that  our  system  does  produce 
what  we  called  an  inert  computation.  However,  an  analytical  verification  of  a 
systolic  network  is  not  always  possible,  and  any  conditions  under  which  a  net¬ 
work  is  analytically  verifiable  using  our  model  are  as  yet  still  unknown.  In  part 
due  to  our  incomplete  sequence  algebra.  As  a  means  for  alleviating  this 
problem,  a  computer  program  was  developed  that  solves  Iteratively  any  system 
of  consistent  causal  equations.  This  solver  may  be  used  in  the  verification  of 
particular  instances  of  computations  whenever  analytical  verifications  are  not 
possible.  The  details  of  this  solver/simulator  will  appear  elsewhere. 

Although  the  abstract  model  has  been  used  here  to  specify  the  archi¬ 
tecture  at  the  level  of  the  computational  cells,  the  same  moojl  can  also  be 

i 

used  for  lower  or  higher  levels  of  architectures  provided  we  define  appropri¬ 
ately  the  domain  of  the  data  items  that  are  transmitted  on  the  communi¬ 
cation  links  of  the  network,  and  the  corresponding  operators. 

Besides  its  value  in  demonstrating  the  power  of  the  systolic  model,  the 
system  that  generates  the  elemental  stiffness  matrices  appears  to  have  merit  of 
its  own.  in  fact,  it  Is  a  contribution  to  the  design  of  an  integrated  systolic 
finite  element  machine.  For  the  implementation  of  any  such  machine,  two 
alternatives  may  be  considered:  1)  We  may  use  a  systolic  network  similar  to 
the  one  proposed  in  1221  to  assemble  the  global  stiffness  matrix  and  then 
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apply  one  of  many  systolic  networks  suggested  in  the  literature  13.23.25)  for 
solving  the  resulting  linear  systems  of  equations.  2)  Or  we  may  use  the  sys¬ 
tem  described  in  this  report  in  a  larger  system  that  employs  an  iterative 
scheme  for  completing  the  finite  element  analysis.  Further  research  is  needed 
to  assess  the  merits  of  the  two  approaches  and  to  determine  the  global  confi¬ 
guration  of  the  system. 

in  addition  to  being  adequate  for  VLSI  implementation,  the  design 

presented  in  this  report  has  the  Important  advantage  of  being  modular  in  the 

sense  that  if  the  system  is  designed  tor  a  specific  value  of  k  (element  type) 
and  q  (quadrature  formula),  it  can  be  easily  modified  to  perform  the  analysis 
for  different  values  of  k  and  q.  Of  course  the  design  is  independent  of  the 
finite  element  mesh  or  the  number  of  elements  m  In  this  mesh. 

We  have  shown  that  for  the  general  class  of  problems  described  in 

section  3.  the  computation  of  the  elemental  matrices  is  completed  in  approxi¬ 
mately  3km  time  units.  However,  a  careful  examination  of  the  design  shows 

that  this  time  may  be  reduced  to  (2k+l)m  for  some  special  problems  in  which 
the  coefficients  a®  f  are  equal  to  zero  for  r=0  or  1=0.  Examples  of  this 

important  class  of  problems  are  the  heat  flow,  the  plain  strain  and  plain 

stress  problems  (9).  To  obtain  this  reduction  in  time,  some  control  parame¬ 
ters  have  to  be  changed  as  well  as  the  forms  of  the  input  sequences.  At 

this  point  we  note  that  with  the  technique  described  in  section  5.  it  can  be 

proved  that  a  successful  pipelining  of  the  operation  on  the  modified  network 
requires  a  pipe  separation  r  equal  to  2k-*- 1 .  This  is  larger  than  the  minimum 

pipe  separation  Tm~2k  ,or  the  computation,  which  means  that  the  modified 

network  cannot  operate  at  maximum  pipeline  efficiency. 
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Finally,  we  note  that  it  is  not  simple  to  define  a  measure  that  estimates 

'  TP 

the  efficiency  of  systolic  networks.  An  Intuitive  measure  would  be  where  T 
is  the  time  needed  by  a  systolic  network  to  complete  the  computation.  P  is 
the  number  of  computational  cells  in  the  network,  and  C  is  the  number  of 
operations  to  be  computed  by  the  network.  This  measure,  however,  does  not 
take  into  consideration  the  type  of  operation  performed  by  a  cell,  which  ranges 
in  our  case  from  simple  memory  cells  to  floating  point  dividers.  It  also 
ignores  the  benefit  obtained  by  the  regular  movement  of  the  data  in  the  net¬ 
work.  in  [26]  the  authors  suggested  a  more  elaborate  measure  that  takes  into 
account  the  band  width  of  the  input  and  output  links  in  the  network  in  com¬ 
paring  the  efficiency  of  the  different  systolic  networks.  Both  measures  estimate 
the  utilization  of  the  computational  cells  In  a  network  without  differentiating 
between  the  different  types  of  cells.  This  is  acceptable  if  all  the  cells  in  the 
network  are  of  the  same  type.  However,  if  the  network  contains  more  than  one 
type  of  cells,  as  is  the  case  with  our  system,  we  believe  that  the  utilization  of 
each  cell  should  be  multiplied  by  a  weight  that  reflects  the  hardware  complex¬ 
ity  of  the  different  cells.  More  work  is  needed  to  develop  an  efficiency  meas¬ 
ure  of  this  type. 
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Appendix 

In  this  appendix  we  list  some  properties  about  combinations  of  the  different 
operators  defined  in  this  report.  All  the  properties  are  directly  verifiable  from  the 
definition  of  the  operators  and  are  very  useful  in  simplifying  any  manipulation  of 
the  sequence  expressions.  It  should  be  noted  that  the  zero  shift  operator  is  not 
included  in  any  property.  This  is  due  to  the  fact  that  it  was  not  used  at  all  in 
modeling  the  networks  presented  in  the  report. 

Most  of  the  properties  take  the  form  "sequence  expression  =  sequence 
expression".  However,  some  have  the  form  *  sequence  expression  -•  sequence 

expression",  where  we  formally  define  the  implication  operator  -  as  follows: 

IF  for  any  t  either  i )(t)=t(t)  or  7)(f)=0  THEN  l-i) 
that  is  7j  is  equal  to  Z  after  replacing  some  of  its  elements  by  0.  Consequently, 

if  t-T).  then  we  may  replace  Z  by  y  in  any  sequence  expression  as  long  as  0  is 

treated  as  a  don't  care  and  not  as  a  special  symbol,  that  is  in  the  contest  of 

inert  computations.  Of  course,  if  Z-V  and  tf-Z  then  Z=V- 


PI)  For  any  element-wise  operator  'op'  with  0  op'  0=0  we  have 

1.1)  For  r  =  n.  6.  E  or  P 


n*)  op'  r<7j)  =  r<*  'op'  7 1) 


1.2)  M 


w  1 . wn 


(«r ••••«„)  0P'  Mr 


,w  1 . wn 


171,.-  ••.»„)  = 


.w  1 . wn 


'op' 


.!<„  ’OP'  P„l> 


1.3)  As  a  direct  result  of  PI. 2  we  have 


,  , _ ,  . .w  1  ....wn  .  .  .,w  1....wn  , 

{  op  Mr  C77 1 .  •  •  •  ,7Jn)  =  ([(  op  7).j]  .  •••.  op  7Jnl) 

1.4)  if.  in  addition,  op'  is  a  0-regular  operator  then 

nr  z  op'  77  =  nr  < 


where  T(C>  =  min(T(7))-r .T (£))  and  Z(t)  =  Ht)  op'  7)(f+r) 
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P2)  For  the  scalar  multiplication  operator  it  follows  that 

2.1)  For  r  =  n.  0.  E  or  P 

w  .  m)  =  rcw  .  o 

n  kAw  1  ...,wn ,  .  ..w  l,...wn ,  .  ,, 

2.2)  w  .  Mr  (7)  ^ ,  •  •  •  .T)n)  =  Mr  (tw  .  7?1J  .  .  [m<  .  77^]) 

P3)  Composition  of  n  with  itself 

3.D  n  n  i  =  n2  i 

3.2)  n“ 1  n  i  =  i 

3.3)  n  rf1  £  =  i  if  and  only  if  £(1)=0 

3.4)  i  -  n  rf 1  e 

P4)  Composition  of  n  with  6 

Qr  Qk  i  =  n(r  +  D*  Qr  i  for  r>0  and  any  k 


PS)  Composition  of  n  with  M 

.  ..  ./I . wn., 

5.D  n  m r 


,  _v  _  .Mw  l....wn 

5.2)  n  Mr  (4r 

5.3)  n  (£r 


■V  = 

for  any  rand  s  >  -r 
w  1  ,...wn 


■V  =  Mr  (fl«n  •  n«l 

..4n)  =  Af^1  •" 


nW 


where  k  =  w,  t  • 


+w_ 


5.4)  M,t, 

5.5) 


,  ,  r  -r  .,wl...,wn,, 
■•«»)  -  n  n  “rti  «>• 

n  * 


1  ■  "n 

rtl  "1 

r--*.n<7  n‘<7{/ 

where  q  = 


•V 


P8)  Composition  of  0  with  E 

••’>  n*  <  - nS  <  « 

6.2)  ejt)  {  =  nr  rf'  { 

6.3)  e1}  nk  i  -  nk  E*r  i 


( 

'  ,*i 

1 


-  .  I 


-  3 

«.j 


■  I 
-  /2#d 


.1 

61 


for  any  r  and  5  >  r 
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P7)  Composition  of  n  with  A 

7.D  Ar-k-s  n"  £  =  n"  ( 

7.2)  a”'-*-5  {  =  nr  rfr  Arf'*s  t 


for  u  <  r 


Pfl)  Composition  of  fl  with  P 

8.D  nr  p*  m<«®)  -  p*  1m(nr  *®) 

e=i. m  '  e=i. m 

8.2)  ^  „,«®> 
0=1.  /n  0  =  1, m 

8.3)  rfr  P*  <f*>  -  P*  m(rfr  *®> 

e  =  1  ,m  *  0=1. m  % 

8.4)  i®)  -  rfr  p*..  c€®> 

0-1,/d  0  =  1, m 


8.5)  -  n*  pj,  ,(^) 

m  m  —  i 


P9)  Composition  of  6  with  itself 

er  ek  l  =  ek  er  t  =  ekrfk+r  i 


if  T(C)  <  k-r 


if  7(0  <  it 

if  n~r  nf  i®  -  *® 


P10)  Composition  of  6  with  £ 

e*-’  t  -  a1'1  t 


P11)  Composition  of  6  with  A 


a'  *  s  e5'1  £  =  es~'  a1*-'  e 


.S-l  .l.it.l 


P12)  Composition  of  6  with  P 

A**  .  (9s  _l  £S)  =  9s-'  A*  .  <£e) 

0=1,  m  0  - 1  ,/d 


PI 3)  Composition  of  M  with  itself 


13.1)  if  (7Jr  •  •  • 


13.2)  •  V‘V  = 

<rmJ 


for  m  +n  <  it 
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P14)  Composition  of  M  with  A 


E 

14.1) 

t!** 

*  ,  Ar.k.s 

•  -*v  =  * 

sr 

for  Kr<5 

14.2) 

a'*-' 

M1 

r 

<)),.•• --V 

where 

’/ 

/ 

=  E 

U  =  1 

PI  5)  Composition  of  E  with  P 

P*  IF9)  =  P*  (£*  £* 

1  e  =  l.ml*  o=l. m  l  * 

PI  6)  Composition  of  A  with  P 

_n/c  ,,8,  _  ..1 


P?*  m(«9)  =  M1*'1  £' 

e=l.  m  o  =  l.  m 


P17)  Other  properties  involving  the  multiplication  operator  *** 

17.1)  £^+1  £  -  7)  -  UKr  +  1)  .  nfn~f  7)  if  T (7))  <  fctr 

17.2)  rf  ••.<„.,)  *  p>>  -  n1  *;■ 

where  C.  =  77((/Dnr)+l)  .  £r  .  r  =0.1 . » •  ♦./»—!  and  on  Is  the 
modulo  t  dltion  operation  on  integers. 

Next,  we  state  two  lemmas  that  can  be  proved  using  the  above  properties 

Lemma  1  :  The  system  of  difference  equations 


V 1  =  n  *0  + 


g=l.  •  •  •  .k 


has  the  solution 


«r.l  *  n'  {1  +  E,  ^  *1 


Lemma  2  :  The  system  of  difference  equations 

O/.i  '  •  IX,  ♦  P/“ 

</  - n  '«<"■)  *  P/1  •  ‘> 

with  the  condition  Pr0  =  1  has  the  solution 


r  =  l.  •  •  •  ,k 


,=ro •• 
/=f0- 
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